Take-home_Ex1: Public Bus Passengers in Singapore

Author

Widya Tantiya Yutika

Published

November 29, 2023

Modified

December 3, 2023

1 Overview

The increasing digitization of urban infrastructures, including buses, taxis, mass rapid transit, public utilities and roads, has generated vast datasets capturing movement patterns over space and time. This data, facilitated by technologies such as GPS and RFID, offers valuable insights into human mobility within cities. Smart cards and GPS devices on public buses, for instance, have enabled the collection of routes and ridership data, providing a rich source for understanding urban movement.

Despite the wealth of data collected, its utilization often remains limited to basic tracking and mapping using Geographic Information System (GIS) applications. This limitation is attributed to the inadequacy of conventional GIS functions in effectively analyzing and modeling spatial and spatio-temporal data.

The objectives of this study are centered around employing Exploratory Spatial Data Analysis (ESDA) techniques, specifically Local Indicators of Spatial Association (LISA) and Emerging Hot Spot Analysis (EHSA), to unveil the spatial and spatio-temporal mobility patterns of public bus passengers in Singapore.

2 The Study Area and Data

2.1 Aspatial Data

The aspatial data used in this take-home exercise is extracted from LTA DataMall (Passenger Volume by Origin Destination Bus Stops) for the month of October 2023.

2.2 Geospatial Data

The geospatial data used in this take-home exercise are as follows.

  • Bus Stop Location from LTA DataMall, which provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.

  • hexagon, a hexagon layer of 250m perpendicular distance between the centre of the hexagon and its edges is used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA.

3 Setting the Analytical Tools

Before I get started, I need to ensure that sf, spdep, tmap, tidyverse, and knitr packages of R are currently installed in my R.

  • sf : for importing and handling geospatial data in R,

  • spdep : for computing spatial weights, global and local spatial autocorrelation statistics, and

  • tmap : for preparing cartographic quality chropleth map

  • tidyverse : for wrangling attribute data in R ; tidyverse has already included collection of packages such as readr, ggplot2, dplyr, tiblle, purr, etc.

  • knitr: for facilitating dynamic report generation in R Markdown documents.

The code chunk below is used to ensure that the necessary R packages have been installed , if its iyet to be installed, it will then be installed and ready to be used in the R environment.

pacman::p_load(sf, spdep, tmap, tidyverse, knitr)

4 Getting the Data into R Environment

4.1 Importing Shapefile into R Environment

The code chunk below uses st_read() of sf package to import BusStop shapefile into R. The imported shapefile will be simple features Object of sf.

busstop <- st_read(dsn = "data/geospatial", layer = "BusStop")
Reading layer `BusStop' from data source 
  `W:\widyayutika\ISSS624\Take-home_Exercise\Take-home_Ex1\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

The code chunk below uses st_geometry() of sf package to display basic information of feature class.

st_geometry(busstop)
Geometry set for 5161 features 
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
First 5 geometries:

The code chunk below uses glimpse() of dplyr package to display the data type of each fields.

glimpse(busstop)
Rows: 5,161
Columns: 4
$ BUS_STOP_N <chr> "22069", "32071", "44331", "96081", "11561", "66191", "2338…
$ BUS_ROOF_N <chr> "B06", "B23", "B01", "B05", "B05", "B03", "B02A", "B02", "B…
$ LOC_DESC   <chr> "OPP CEVA LOGISTICS", "AFT TRACK 13", "BLK 239", "GRACE IND…
$ geometry   <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.38),…

Next, I will plot the geospatial data using the code chunk below.

plot(busstop)

From the glimpse() check above, it is shown that the BUS_STOP_N is in character type. It needs to be converted to factor type to work with categorical variables so that I can use them to georeference with bus stop location data.

busstop$BUS_STOP_N <- as.factor(busstop$BUS_STOP_N)

Next, I will confirm the data type for BUS_STOP_N has changed to data type of “factor” using glimpse().

glimpse(busstop)
Rows: 5,161
Columns: 4
$ BUS_STOP_N <fct> 22069, 32071, 44331, 96081, 11561, 66191, 23389, 54411, 285…
$ BUS_ROOF_N <chr> "B06", "B23", "B01", "B05", "B05", "B03", "B02A", "B02", "B…
$ LOC_DESC   <chr> "OPP CEVA LOGISTICS", "AFT TRACK 13", "BLK 239", "GRACE IND…
$ geometry   <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.38),…

4.2 Importing Csv File into R Environment

Next, I will import origin_destination_bus_202310.csv into R by using st_read() of readr package. The output is R dataframe class.

odbus <- read_csv("data/aspatial/origin_destination_bus_202310.csv")

The code chunk below uses glimpse() of dplyr package to display the odbus tibble data tables.

glimpse(odbus)
Rows: 5,694,297
Columns: 7
$ YEAR_MONTH          <chr> "2023-10", "2023-10", "2023-10", "2023-10", "2023-…
$ DAY_TYPE            <chr> "WEEKENDS/HOLIDAY", "WEEKDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 7, 14, 14, 10, 20, 20,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <chr> "04168", "04168", "80119", "80119", "44069", "2028…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "2014…
$ TOTAL_TRIPS         <dbl> 3, 5, 3, 5, 4, 1, 24, 2, 1, 7, 3, 2, 5, 1, 1, 1, 1…

From the glimpse() check above, it is shown that the ORIGIN_PT_CODE and DESTINATION_PT_CODE are in character type. Both of them need to be converted to factor type to work with categorical variables so that I can use them to georeference with bus stop location data.

odbus$ORIGIN_PT_CODE <- as.factor(odbus$ORIGIN_PT_CODE)
odbus$DESTINATION_PT_CODE <- as.factor(odbus$DESTINATION_PT_CODE) 

I will also change the data type of TIME_PER_HOUR and TOTAL_TRIPS from <dbl> to <int> because it both of these fields should be represented as whole numbers.

odbus$TIME_PER_HOUR <- as.integer(odbus$TIME_PER_HOUR)
odbus$TOTAL_TRIPS <- as.integer(odbus$TOTAL_TRIPS) 

Next, I will confirm the data type for ORIGIN_PT_CODE and DESTINATION_PT_CODE have changed to factor using glimpse().

glimpse(odbus)
Rows: 5,694,297
Columns: 7
$ YEAR_MONTH          <chr> "2023-10", "2023-10", "2023-10", "2023-10", "2023-…
$ DAY_TYPE            <chr> "WEEKENDS/HOLIDAY", "WEEKDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <int> 16, 16, 14, 14, 17, 17, 17, 7, 14, 14, 10, 20, 20,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <fct> 04168, 04168, 80119, 80119, 44069, 20281, 20281, 1…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 20141, 20141, 1…
$ TOTAL_TRIPS         <int> 3, 5, 3, 5, 4, 1, 24, 2, 1, 7, 3, 2, 5, 1, 1, 1, 1…

5 Data Wrangling

5.1 Checking the Reference Coordinate System of Geospatial Data

Common issue in importing geospatial data into R is that the coordinate system of the source data was either missing (due to missing .proj for ESRI shapefile, etc.) or wrongly assigned.

The code chunk below uses st_crs() of sf package to retrieve the coordinate reference system of busstop.

st_crs(busstop)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["WGS 84",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]

Both busstop is projected in svy21 as shown from the second line, but at the last line, it is mentioned that the EPSG is 9001. This is wrongly assigned because the correct EPSG code for svy21 is 3414.

5.2 Transforming the Projection

Next, I will transform busstop from geographic coordinate system to projected coordinated system as my analysis will measure distance or/and area.

The code chunk below uses st_transform of sp package to convert coordinates to EPSG code of 3414.

busstop3414 <- st_transform(busstop, 3414)

Next, I will check the coordinate system after transformation with the code chunk below.

st_crs(busstop3414)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

As noticed from the above, the Projected CRS is now SVY21 / Singapore TM and the last line has changed to EPSG 3414.

5.3 Checking Duplicated Records

The code chunk below is used to check for duplicated records on odbus.

duplicate <- odbus %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

The above code chunk shows that there is no duplicate record found.

5.4 Remove Unnecessary Fields

First, I check the YEAR_MONTH and PT_TYPE unique values by using the table() to create a frequency table of each categorical representation.

YEAR_MONTH_counts <- table(odbus$YEAR_MONTH)
print(YEAR_MONTH_counts)

2023-10 
5694297 
DAY_TYPE_counts <- table(odbus$DAY_TYPE)
print(DAY_TYPE_counts)

         WEEKDAY WEEKENDS/HOLIDAY 
         3259419          2434878 
PT_TYPE_counts <- table(odbus$PT_TYPE)
print(PT_TYPE_counts)

    BUS 
5694297 

From the results above, I will exclude YEAR_MONTH and PT_TYPE as they only have single categorical representation using the code chunk below.

odbus <- odbus[, !(names(odbus) %in% c("YEAR_MONTH", "PT_TYPE"))]

I will then use glimpse() to ensure the process is done correctly.

glimpse(odbus)
Rows: 5,694,297
Columns: 5
$ DAY_TYPE            <chr> "WEEKENDS/HOLIDAY", "WEEKDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <int> 16, 16, 14, 14, 17, 17, 17, 7, 14, 14, 10, 20, 20,…
$ ORIGIN_PT_CODE      <fct> 04168, 04168, 80119, 80119, 44069, 20281, 20281, 1…
$ DESTINATION_PT_CODE <fct> 10051, 10051, 90079, 90079, 17229, 20141, 20141, 1…
$ TOTAL_TRIPS         <int> 3, 5, 3, 5, 4, 1, 24, 2, 1, 7, 3, 2, 5, 1, 1, 1, 1…

6 Creating honeycomb_grid

Honeycomb grid are preferred to replace coarse and irregular Master Plan 2019 Sub-zone GIS data set of URA because hexagon reduce sampling bias due to its grid shape of low perimeter to are ratio and its ability to form evenly spaced grid. Honeycomb grids are well-suited for approximating circular areas, making them suitable for mapping Singapore edges with is irregular shape.

The code chunk below uses st_make_grid of sf package to create a hexagonal or honeycomb grid with a 250m (perpendicular distance between the center of hexagon and its edges). According the the R documentation, the cellsize is the distance between opposite edges, which is 2 times the perpendicular distance between the center of hexagon and its edges. Thus, for the purpose of this exercise, I will use the cellsize of 500m and indicate the square=FALSE for hexagonal grid. After doing do, I will create a grid_id for each hexagonal grid.

area_honeycomb_grid = st_make_grid(busstop3414, c(500, 500), what = "polygons", square = FALSE)    
# To sf and add grid ID  
honeycomb_grid_sf = st_sf(area_honeycomb_grid) %>%    
  # add grid ID      
  mutate(grid_id = 1:length(lengths(area_honeycomb_grid)))

7 Extracting the study data

In this exercise, I will extract the commuting flows during peak hours as follows.

Bus tap on time
Weekday morning peak 6am to 9am
Weekday afternoon peak 5pm to 8pm
Weekend/holiday morning peak 11am to 2pm
Weekend/holiday evening peak 4pm to 7pm

7.1 Weekday Morning Peak

The code chunk below will be used to extract the weekday morning peak (Weekday: 6-9am) and calculate the passenger trips in each origin bus stop by using the group_by() from dplyr package and aggregate the values using summarise() and sum up the “Total_Trips”. The mutate() in the code below is to ensure that after the group_by, the ORIGIN_PT_CODE remains in the factor data type.

odbus_weekday_6_9 <- odbus %>%
  filter(DAY_TYPE == "WEEKDAY") %>%
  filter(TIME_PER_HOUR >= 6 &
           TIME_PER_HOUR <= 9) %>%
  group_by(ORIGIN_PT_CODE) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS))%>%
  mutate(ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE))

I will repeat the processes above for the other peak hours as shown below.

7.2 Weekday Afternoon Peak

The code chunk below will be used to extract the weekday afternoon peak (Weekday: 5-8pm) and calculate the passenger trips in each origin bus stop.

odbus_weekday_17_20 <- odbus %>%   
  filter(DAY_TYPE == "WEEKDAY") %>%   
  filter(TIME_PER_HOUR >= 17 &            
           TIME_PER_HOUR <= 20) %>%   
  group_by(ORIGIN_PT_CODE) %>%   
  summarise(TRIPS = sum(TOTAL_TRIPS)) %>%
  mutate(ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE))

7.3 Weekends/Holiday Morning Peak

The code chunk below will be used to extract the weekend/holiday morning peak (Weekend/holiday: 11am-2pm) and calculate the passenger trips in each origin bus stop.

odbus_weekend_11_14 <- odbus %>% 
  filter(DAY_TYPE == "WEEKENDS/HOLIDAY") %>%   
  filter(TIME_PER_HOUR >= 11 &            
           TIME_PER_HOUR <= 14) %>%   
  group_by(ORIGIN_PT_CODE) %>%   
  summarise(TRIPS = sum(TOTAL_TRIPS)) %>%
  mutate(ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE))

7.4 Weekends/Holiday Evening Peak

The code chunk below will be used to extract the weekend/holiday evening peak (Weekend/holiday: 4-7pm) and calculate the total trips in each origin and destination pair.

odbus_weekend_16_19 <- odbus %>%   
  filter(DAY_TYPE == "WEEKENDS/HOLIDAY") %>%   
  filter(TIME_PER_HOUR >= 16 &            
           TIME_PER_HOUR <= 19) %>%   
  group_by(ORIGIN_PT_CODE) %>%   
  summarise(TRIPS = sum(TOTAL_TRIPS)) %>%
  mutate(ORIGIN_PT_CODE = as.factor(ORIGIN_PT_CODE))

8 Further Data Wrangling for Weekday Morning Peak Hour: A Step-by-Step Guide

This section provides a comprehensive step-by step walkthrough to calculate the number of trips within each hexagonal grid during Weekday Morning Peak Hour with a subsequent plan to replicate the same process for Weekday Afternoon Peak Hour, Weekends/Holiday Morning Peak, and Weekends/Holiday Evening Peak in the subsequent section

8.1 Performing Relational Join

The code chunk below will be used to join the busstop3414 SpatialPolygonsDataframe and odbus_weekday_6_9_data by BUS_STOP_N for busstop3414 and BUS_STOP_ID for original_destination_bus. This is performed by using left_join() of dplyr package. In this take-home exercise, I will focus on passenger trips generated by origin bus stop, I will remove the rows with bus stops solely serve as destinations which are indicated by NA values on the corresponding “Total_Trips” using the filter() from dplyr package.

total_trips_per_busstop_wdmp <- left_join(busstop3414, odbus_weekday_6_9, by = c("BUS_STOP_N" = "ORIGIN_PT_CODE")) %>%
  filter(!is.na(TRIPS))

8.2 Spatial Join with Hexagonal Honeycomb Grid and Calculating Total Trips in a Hexagonal Grid

The code chunk below will be used to join the total_trips_per_busstop and honeycomb grid spatially using st_join() from sf package and remove the hexagon grid without any bus stop which is indicated by NA value on the “BUS_STOP_N”. Next, I will calculate the total trips in a hexagonal grid using the group_by() from dplyr package.

total_trips_per_grid_wdmp <- st_join(honeycomb_grid_sf,total_trips_per_busstop_wdmp) %>%
  filter(!is.na(BUS_STOP_N))%>%
  group_by(grid_id) %>%
  summarise(total_trips = sum(TRIPS))

8.3 Replicating the Steps for Other Peak Hours

8.3.1 Weekday Afternoon Peak

total_trips_per_busstop_wdap <- left_join(busstop3414, odbus_weekday_17_20, by = c("BUS_STOP_N" = "ORIGIN_PT_CODE"))%>%
  filter(!is.na(TRIPS))

total_trips_per_grid_wdap <- st_join(honeycomb_grid_sf,total_trips_per_busstop_wdap) %>%
  filter(!is.na(BUS_STOP_N))%>%
  group_by(grid_id) %>%
  summarise(total_trips = sum(TRIPS))

8.3.2 Weekends/Holiday Morning Peak

total_trips_per_busstop_wemp <- left_join(busstop3414, odbus_weekend_11_14, by = c("BUS_STOP_N" = "ORIGIN_PT_CODE"))%>%
  filter(!is.na(TRIPS))

total_trips_per_grid_wemp <- st_join(honeycomb_grid_sf,total_trips_per_busstop_wemp) %>%
  filter(!is.na(BUS_STOP_N))%>%
  group_by(grid_id) %>%
  summarise(total_trips = sum(TRIPS))

8.3.3 Weekends/Holiday Evening Peak

total_trips_per_busstop_weep <- left_join(busstop3414, odbus_weekend_16_19, by = c("BUS_STOP_N" = "ORIGIN_PT_CODE"))%>%
  filter(!is.na(TRIPS))

total_trips_per_grid_weep <- st_join(honeycomb_grid_sf,total_trips_per_busstop_weep) %>%
  filter(!is.na(BUS_STOP_N))%>%
  group_by(grid_id) %>%
  summarise(total_trips = sum(TRIPS))

9 Geovisualisation and Analysis

After the data preparation, I will first plot the distribution of passenger trips using ggplot() of tidyverse package. To consolidate these distributions into a single plot, it is necessary to introduce grouping variable to each dataframe. Within this unified plot, key summary statistics including Q1, median, Q3, and mean highligthed by a red circle will be presented. This approach aims to provide a comprehensive comparison of passenger trip characteristics across various peak hour groups.

# Add a grouping variable to each dataframe
total_trips_per_grid_wdmp$Peak_Hour_Group <- "wdmp"
total_trips_per_grid_wdap$Peak_Hour_Group <- "wdap"
total_trips_per_grid_wemp$Peak_Hour_Group <- "wemp"
total_trips_per_grid_weep$Peak_Hour_Group <- "weep"

# Combine dataframes into a single dataframe
combined_peak_hour <- rbind(total_trips_per_grid_wdmp, total_trips_per_grid_wdap, total_trips_per_grid_wemp, total_trips_per_grid_weep) %>%
  filter(!is.na(total_trips))

# Create a box plot using ggplot
ggplot(combined_peak_hour, aes(x = Peak_Hour_Group, y = total_trips)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape = 18, size = 3, color = "red")+
  stat_boxplot(geom = 'errorbar', width = 0.5, color = 'blue', size = 1)+
  stat_summary(
    geom = 'text',
    fun.min = function(x) quantile(x, 0.25),
    fun = median,
    fun.max = function(x) quantile(x, 0.75),
    aes(label = sprintf("Q3: %.2f\nMedian: %.2f\nQ1: %.2f", ..ymax.., ..y.., ..ymin..)),
    vjust = -5,
    hjust= -0.08,
    position = position_dodge(width = 0.75),
    size = 3
  ) +
  stat_summary(geom = 'text', fun = mean, aes(label = sprintf("Mean: %.2f", ..y..)), vjust = -3, hjust=-0.08 ,  position = position_dodge(width = 0.9), color = 'red', size =3)+
  
  labs(title = "Boxplot for Total Trips in each Peak Hour Group", x="Peak Hour Group",y = "Passenger Trips") +
  theme_minimal()

The box plot analysis above reveal the patterns in Singapore’s bus passenger trips. Notably, the mean passenger trips during weekdays significantly surpass those on weekends and holidays, suggesting higher demand for bus services during typical workdays. This aligns with a common observation that commuting is more crowded on workdays, reflecting the daily hustle and bustle of the workforce.

Furthermore, the observation that the mean of all peak hour groups exceeds their respective medians indicates a right-skewed distribution. This skewness implies that on average, there are more instances of relatively small number of passenger trips during peak hour with occasional instances of significantly higher demand. This distribution pattern underscores the challenges faced by commuters during peak hours, where a substantial portion of bus rides may experience higher congestion.

Some hexagonal grid has a total of more than 400,000 passenger trips highlighting that specific areas with exceptionally high demand during the weekdays. These areas are likely represent key commuting regions with concentrated commercial and residential activities and are possibly areas that are not easily accessible by MRT. To gain clearer insights on the concentrations of high passenger trips I will leverage tmap package.

9.1 Weekday Morning Peak

tmap_mode("view")

tm_shape(total_trips_per_grid_wdmp) +
  tm_fill(
    col = "total_trips",
    palette = c("#C5FFF8", "#FF4B91"),
    style = "cont",
    title = "Number of Trips",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.vars = c("grid_id","total_trips"),
    popup.format = list(
      grid_id = list(format = "f", digits = 0),
      total_trips = list(format = "f", digits = 0))
  ) +
  tm_borders(col = "grey40", lwd = 0.7)
tmap_mode("plot")

It has been observed that certain bus stops are located outside the boundaries of Singapore, particularly in Johor.The influx of weekday morning peak passenger trips from Johor are qhigh surpassing 100,000 for October 2023. This significant amount shows the cross border commuting activity between Johor and Singapore suggesting a preference among some individuals to reside in Johor may be due to cost consideration while working in Singapore.

The Central Business District (CBD) area displays a comparatively lower passenger trips generated by origin bus stop. This is attributed to the absence of major bus interchanges within the CBD, suggesting that commuters in this central business hub may rely on alternative modes of transportation, such as the Mass Rapid Transit (MRT) system. In addition, it is worth noting that lower passenger trips may also be influences by the fact that bus routes typically do not commence within CBD area, even though the bus routes pass through CBD.

Areas that are exhibiting a high concentration during weekday morning peak are associated with prominent bus interchanges such as Woodlands and Boon Lay with more than 300,000 passenger trips within a single month on weekdays. Additionally, other bus interchanges including Bishan, Ang Mo Kio, Toa Payoh, Clementi, Punggol, Tampines, and Bedok also shows high passenger trips.

9.2 Weekday Afternoon Peak

tmap_mode("view")

tm_shape(total_trips_per_grid_wdap) +
  tm_fill(
    col = "total_trips",
    palette = c("#C5FFF8", "#FF4B91"),
    style = "cont",
    title = "Number of Trips",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.vars = c("grid_id","total_trips"),
    popup.format = list(
      grid_id = list(format = "f", digits = 0),
      total_trips = list(format = "f", digits = 0))
  ) +
  tm_borders(col = "grey40", lwd = 0.7)
tmap_mode("plot")

During the weekday afternoon peak, Woodlands and Boon Lay continue to exhibit an impressive volume of passenger trips exceeding 400,000 as compared to the patterns observed during morning peak. In addition, there are additional areas such as Ang Mo Kio, Tampines, and Bedok which emerge as significant contributors to the passenger trips during the later peak period.

Typically the passengers volume during weekday afternoon are higher than weekday morning with Boon Lay with the highest contributor of more than 500,000 passenger trips. Boon Lay serves as a transportation hub for workers, residents and students from NTU. Moreover, the areas near Boon Lay are currently not accessible through MRT as the development of MRT is still in progress.

9.3 Weekends/Holiday Morning Peak

tmap_mode("view")
tm_shape(total_trips_per_grid_wemp) +
  tm_fill(
    col = "total_trips",
    palette = c("#C5FFF8", "#FF4B91"),
    style = "cont",
    title = "Number of Trips",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.vars = c("grid_id","total_trips"),
    popup.format = list(
      grid_id = list(format = "f", digits = 0),
      total_trips = list(format = "f", digits = 0))
  ) +
  tm_borders(col = "grey40", lwd = 0.7)
tmap_mode("plot")

Given that weekends consist of only two days, the passenger trip numbers exhibit lower in total numbers but comparable to those recorded during five weekdays, following a similar pattern. Notably, major bus interchanges, such as Woodlands, Boon Lay, Bedok and Tampines continue to play a pivotal role contributing significantly to the overall passenger trips during both weekdays and weekends.

During weekend and holidays, the morning peak hour (11am to 2pm) is later as compared to weekday morning (6 to 9am). This temporal shift can be attributed to social dynamic where individuals engaging in lunchtime activity with family and friends during weekends. This distinctive pattern underscores the influence of social interactions on commuter behaviors during non-working days.

9.4 Weekends/Holiday Evening Peak

tmap_mode("view")
tm_shape(total_trips_per_grid_weep) +
  tm_fill(
    col = "total_trips",
    palette = c("#C5FFF8", "#FF4B91"),
    style = "cont",
    title = "Number of Trips",
    id = "grid_id",
    showNA = FALSE,
    alpha = 0.6,
    popup.vars = c("grid_id","total_trips"),
    popup.format = list(
      grid_id = list(format = "f", digits = 0),
      total_trips = list(format = "f", digits = 0))
  ) +
  tm_borders(col = "grey40", lwd = 0.7)

Similar with the other peak hours, the passenger trips are mostly contributed by major bus interchange as the origin.

Typically , the afternoon/evening peak hours witnesses a higher volume of passenger trips both on weekdays and weekend as compared to morning peak hours. This preference may be due to individuals opting for MRT during hours as it provide punctual mode of transportation as compared to bus which might be affected by congestion and could potentially lead to delays in reaching to work/schools/leisure activities.

During weekend and holidays, the evening peak hour (4 to 7pm) is slightly earlier as compared to weekday morning afternoon (5 to 8pm). This shift in timing could be attributed to the altered schedules and leisurely activities that individuals typically engage in during weekends and holidays. People might be more inclined to initiate their evening commutes earlier to accommodate social plans, family gatherings, or recreational activities that are common during non-working days.

10 Exploratory Spatial Data Analysis

Exploratory Spatial Data Analysis or ESDA consists of descriptive techniques to discover spatial distribution of data and identify outliers.

In this section, I will cover global spatial autocorrelation which focuses on overall tredn and local spatial autocorrelation which focuses on identifying hot and cold spots in the data.

10.1 Global Spatial Autocorrelation

In this section, I will include the computation of global spatial autocorrelation statistics and spatial complete randomness test for global spatial autocorrelation. The goal here is to understand whether the passenger trips generated by origin are evenly distributed across Singapore.

10.1.1 Spatial Weights Matrix

Before computing global spatial autocorrelation, we need to define spatial neighbourhood by using spatial weight. There are two common methods to compute spatial weight which are contiguity-based and distanced-based.

In contiguity-based method, neighbour share common boundary and there are 2 methods in defining the boundary, ROOK by common edge while QUEEN by common edge and vertices as shown below for square shape.

Hook Neighbors

Queen Neighbors

In hexagonal grid, finding neighbours are straighforward. Both ROOK and QUEEN yield the same results as shown below.

Hexagon Neighbors
![Hexagon Neighbors](data/figure/Hexagonal_Neighbors.png)
![Hook Neighbors](data/figure/Rook.png)
![Queen Neighbors](data/figure/Queen.png)

In distance-based method, there are 2 method fixed weighting where the grid are considered neighbours if they are within specified distance from one another and adaptive weighting where each grid has same specified number of neighbours.

If the hexagonal grid are isolated from each other, contiguity-based method may not be appropriate as it may yield many grids with no neighbours.

10.1.2 Contiguity Weight Matrix (QUEEN)

In the code chunk below, I will use poly2nb() of spdep package to compute contiguity weight matrices for weekday morning peak hour. This function builds a list of neighbours based on grids with contiguous boundaries. By default, Queen contiguity is applied.

wm_q <- poly2nb(total_trips_per_grid_wdmp, queen=TRUE)
summary(wm_q)
Neighbour list object:
Number of regions: 1492 
Number of nonzero links: 6714 
Percentage nonzero weights: 0.3016086 
Average number of links: 4.5 
12 regions with no links:
276 296 454 550 713 964 1030 1387 1477 1480 1484 1492
Link number distribution:

  0   1   2   3   4   5   6 
 12  40 105 206 285 358 486 
40 least connected regions:
1 7 22 38 98 166 183 184 185 191 207 214 253 257 260 551 595 629 683 695 719 738 755 771 855 990 1004 1005 1029 1069 1194 1436 1443 1454 1472 1473 1475 1478 1482 1491 with 1 link
486 most connected regions:
10 13 16 17 24 25 31 35 42 43 48 53 55 60 63 67 73 77 80 81 84 85 87 88 91 92 97 102 107 111 117 121 127 132 138 139 141 145 146 147 151 152 153 154 160 161 162 170 171 172 179 180 181 187 188 189 190 196 197 198 201 202 203 204 212 225 235 239 240 242 252 268 272 280 287 289 290 291 293 299 300 302 303 306 311 314 315 317 327 328 329 330 333 342 345 353 358 380 381 390 392 393 397 404 408 413 415 421 426 427 428 430 433 440 441 442 450 456 457 458 459 463 467 470 471 478 482 483 485 491 492 496 502 503 506 507 512 518 523 528 532 537 538 539 541 545 547 553 557 562 563 565 566 570 579 580 583 587 588 592 593 597 603 607 611 613 620 623 624 625 635 636 637 641 642 644 645 646 656 657 658 664 667 668 669 674 675 677 678 687 688 691 692 693 700 703 704 711 714 715 716 727 728 742 744 745 747 758 761 762 763 764 769 770 774 775 776 779 780 781 782 786 787 792 793 796 797 798 799 805 806 809 810 811 816 817 818 827 829 830 832 833 834 836 837 838 839 840 846 849 851 852 853 857 858 862 863 864 866 867 868 870 871 874 877 879 882 885 888 890 891 895 899 904 906 911 912 913 915 916 920 928 929 930 931 932 933 938 942 943 946 947 952 953 955 956 957 961 967 968 969 970 971 973 979 980 981 982 987 994 995 996 997 1007 1008 1009 1011 1012 1019 1020 1021 1025 1033 1034 1037 1039 1040 1045 1046 1047 1049 1050 1051 1052 1059 1061 1062 1063 1066 1072 1076 1083 1084 1085 1088 1089 1093 1094 1100 1103 1104 1105 1111 1116 1117 1118 1119 1124 1125 1127 1128 1129 1130 1131 1133 1139 1140 1141 1145 1146 1147 1149 1151 1152 1153 1154 1158 1159 1160 1161 1162 1166 1168 1171 1172 1173 1174 1175 1181 1182 1183 1184 1185 1186 1187 1190 1191 1197 1198 1199 1200 1201 1207 1213 1214 1215 1219 1224 1225 1231 1233 1234 1235 1240 1244 1245 1250 1251 1252 1256 1259 1261 1267 1276 1279 1280 1293 1298 1299 1300 1301 1303 1304 1305 1308 1309 1311 1317 1326 1328 1329 1337 1338 1340 1343 1344 1349 1352 1353 1354 1356 1360 1363 1365 1367 1370 1378 1380 1384 1389 1390 1392 1396 1397 1398 1399 1400 1405 1406 1407 1408 1412 1413 1414 1418 1419 1421 1423 1425 1428 1429 1431 1432 1433 1434 1441 with 6 links

The summary report above shows that there are 1492 hexagonal grids. There are 12 hexagonal grid with no neighbour, 40 hexagonal grids with 1 neighbour and the most connected grids have 6 links. The average number of links is 4.5.

10.1.3 Contiguity Weight Matrix (ROOK)

In the code chunk below, I will use poly2nb() of spdep package to compute contiguity weight matrices for weekday morning peak hour by specifying queen = FALSE to compute Rook contiguity.

wm_r <- poly2nb(total_trips_per_grid_wdmp, queen=FALSE)
summary(wm_r)
Neighbour list object:
Number of regions: 1492 
Number of nonzero links: 6714 
Percentage nonzero weights: 0.3016086 
Average number of links: 4.5 
12 regions with no links:
276 296 454 550 713 964 1030 1387 1477 1480 1484 1492
Link number distribution:

  0   1   2   3   4   5   6 
 12  40 105 206 285 358 486 
40 least connected regions:
1 7 22 38 98 166 183 184 185 191 207 214 253 257 260 551 595 629 683 695 719 738 755 771 855 990 1004 1005 1029 1069 1194 1436 1443 1454 1472 1473 1475 1478 1482 1491 with 1 link
486 most connected regions:
10 13 16 17 24 25 31 35 42 43 48 53 55 60 63 67 73 77 80 81 84 85 87 88 91 92 97 102 107 111 117 121 127 132 138 139 141 145 146 147 151 152 153 154 160 161 162 170 171 172 179 180 181 187 188 189 190 196 197 198 201 202 203 204 212 225 235 239 240 242 252 268 272 280 287 289 290 291 293 299 300 302 303 306 311 314 315 317 327 328 329 330 333 342 345 353 358 380 381 390 392 393 397 404 408 413 415 421 426 427 428 430 433 440 441 442 450 456 457 458 459 463 467 470 471 478 482 483 485 491 492 496 502 503 506 507 512 518 523 528 532 537 538 539 541 545 547 553 557 562 563 565 566 570 579 580 583 587 588 592 593 597 603 607 611 613 620 623 624 625 635 636 637 641 642 644 645 646 656 657 658 664 667 668 669 674 675 677 678 687 688 691 692 693 700 703 704 711 714 715 716 727 728 742 744 745 747 758 761 762 763 764 769 770 774 775 776 779 780 781 782 786 787 792 793 796 797 798 799 805 806 809 810 811 816 817 818 827 829 830 832 833 834 836 837 838 839 840 846 849 851 852 853 857 858 862 863 864 866 867 868 870 871 874 877 879 882 885 888 890 891 895 899 904 906 911 912 913 915 916 920 928 929 930 931 932 933 938 942 943 946 947 952 953 955 956 957 961 967 968 969 970 971 973 979 980 981 982 987 994 995 996 997 1007 1008 1009 1011 1012 1019 1020 1021 1025 1033 1034 1037 1039 1040 1045 1046 1047 1049 1050 1051 1052 1059 1061 1062 1063 1066 1072 1076 1083 1084 1085 1088 1089 1093 1094 1100 1103 1104 1105 1111 1116 1117 1118 1119 1124 1125 1127 1128 1129 1130 1131 1133 1139 1140 1141 1145 1146 1147 1149 1151 1152 1153 1154 1158 1159 1160 1161 1162 1166 1168 1171 1172 1173 1174 1175 1181 1182 1183 1184 1185 1186 1187 1190 1191 1197 1198 1199 1200 1201 1207 1213 1214 1215 1219 1224 1225 1231 1233 1234 1235 1240 1244 1245 1250 1251 1252 1256 1259 1261 1267 1276 1279 1280 1293 1298 1299 1300 1301 1303 1304 1305 1308 1309 1311 1317 1326 1328 1329 1337 1338 1340 1343 1344 1349 1352 1353 1354 1356 1360 1363 1365 1367 1370 1378 1380 1384 1389 1390 1392 1396 1397 1398 1399 1400 1405 1406 1407 1408 1412 1413 1414 1418 1419 1421 1423 1425 1428 1429 1431 1432 1433 1434 1441 with 6 links

The summary report above shows that there are 1492 hexagonal grids. There are 12 hexagonal grid with no neighbour, 40 hexagonal grids with 1 neighbour and the most connected grids have 6 links. The average number of links is 4.5.

Note: The results for both rook and queen method are the same as shown from the computation above.

10.1.4 Visualising contiguity weights

Before visualising the weights, I need to reproject coordingate to WGS84 for longitude-latitude projection using st_transform of sf package.

total_trips_per_grid_wdmp$area_honeycomb_grid <- st_transform(total_trips_per_grid_wdmp$area_honeycomb_grid, "+proj=longlat +datum=WGS84")

Next, I will get the coordinates of the hexagonal grid centroid in longitude and latitude using the st_coordinates and st_centroid of sf package.

coords <- st_coordinates(st_centroid(total_trips_per_grid_wdmp$area_honeycomb_grid))
head(coords)
            X        Y
[1,] 103.6174 1.271424
[2,] 103.6196 1.275340
[3,] 103.6219 1.294920
[4,] 103.6241 1.275341
[5,] 103.6241 1.291005
[6,] 103.6241 1.298837

I will only plot queen contiguity as rook contiguity yields the same results.

plot(total_trips_per_grid_wdmp$area_honeycomb_grid, border="lightgrey", main="Queen and Rook Contiguity")
plot(wm_q, coords, pch = 19, cex = 0.6, add = TRUE, col= "red")

10.1.5 Fixed Distance Weight Matrix

The dnearneigh() of spdep package will be used to derive the distance-based weight matrices by . This function identifies neighbours of hexagonal grid centroid points by Euclidean distance with a lower and upper bounds distance controlled by the bounds argument or by Great Circle distance in kilometres if longlat argument is set to TRUE.

10.1.5.1 Determine the cut-off distance
  • Return a matrix with the indices of points belonging to the set of the k nearest neighbours of each other by using knearneigh() of spdep.

  • Convert the knn object returned by knearneigh() into a neighbours list of class nb with a list of integer vectors containing neighbour region number ids by using knn2nb().

  • Return the length of neighbour relationship edges by using nbdists() of spdep. The function returns in the units of the coordinates if the coordinates are projected, in km otherwise.

  • Remove the list structure of the returned object by using unlist().

k1 <- knn2nb(knearneigh(coords))
k1dists <- unlist(nbdists(k1, coords, longlat = TRUE))
summary(k1dists)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5000  0.5000  0.5000  0.5072  0.5000  4.5825 

The summary report shows that the largest first nearest neighbour distance is 4,582.5 metres, so using a number slightly larger than this (i.e. 4.6) as the upper threshold gives certainty that all regions will have at least one neighbour.

total_trips_per_grid_wdmp$grid_id[match(max(k1dists), k1dists)]
[1] 1767

Using the code chunk above, we discover that the grid_id with the maximum distance to its nearest neighbour is 1767 which is the grid in Johor.

10.1.5.2 Computing fixed distance weight matrix

Now, we will compute the distance weight matrix by using dnearneigh() as shown below.

wm_d4.6 <- dnearneigh(coords, 0, 4.6, longlat = TRUE)
wm_d4.6
Neighbour list object:
Number of regions: 1492 
Number of nonzero links: 236430 
Percentage nonzero weights: 10.62099 
Average number of links: 158.4651 

From the output, we see that the average number of links is 158.4651. The number is quite high and may skew the analysis.

Next, we will use str() to display the content of wm_d4.6 weight matrix.

str(wm_d4.6)
List of 1492
 $ : int [1:30] 2 3 4 5 6 7 8 9 10 11 ...
 $ : int [1:33] 1 3 4 5 6 7 8 9 10 11 ...
 $ : int [1:63] 1 2 4 5 6 7 8 9 10 11 ...
 $ : int [1:35] 1 2 3 5 6 7 8 9 10 11 ...
 $ : int [1:61] 1 2 3 4 6 7 8 9 10 11 ...
 $ : int [1:74] 1 2 3 4 5 7 8 9 10 11 ...
 $ : int [1:86] 1 2 3 4 5 6 8 9 10 11 ...
 $ : int [1:43] 1 2 3 4 5 6 7 9 10 11 ...
 $ : int [1:57] 1 2 3 4 5 6 7 8 10 11 ...
 $ : int [1:72] 1 2 3 4 5 6 7 8 9 11 ...
 $ : int [1:91] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:52] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:69] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:84] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:94] 2 3 4 5 6 7 8 9 10 11 ...
 $ : int [1:66] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:82] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:94] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:98] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:93] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:85] 3 5 6 7 9 10 11 13 14 15 ...
 $ : int [1:43] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:60] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:77] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:91] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:99] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:99] 2 3 4 5 6 7 8 9 10 11 ...
 $ : int [1:94] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:84] 3 5 6 7 10 11 13 14 15 17 ...
 $ : int [1:71] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:89] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:99] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:103] 2 3 4 5 6 7 8 9 10 11 ...
 $ : int [1:99] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:93] 3 5 6 7 9 10 11 13 14 15 ...
 $ : int [1:81] 3 6 7 10 11 14 15 17 18 19 ...
 $ : int [1:63] 7 11 15 18 19 20 21 26 27 28 ...
 $ : int [1:48] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:65] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:84] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:97] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:101] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:92] 3 5 6 7 10 11 13 14 15 17 ...
 $ : int [1:78] 6 7 11 14 15 18 19 20 21 25 ...
 $ : int [1:60] 7 11 15 19 20 21 26 27 28 29 ...
 $ : int [1:96] 1 2 3 4 5 6 7 8 9 10 ...
 $ : int [1:108] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:100] 3 5 6 7 9 10 11 13 14 15 ...
 $ : int [1:90] 6 7 10 11 14 15 17 18 19 20 ...
 $ : int [1:74] 7 11 15 18 19 20 21 26 27 28 ...
 $ : int [1:113] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:109] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:101] 3 6 7 10 11 13 14 15 17 18 ...
 $ : int [1:69] 11 15 19 20 21 26 27 28 29 33 ...
 $ : int [1:110] 3 5 6 7 10 11 13 14 15 16 ...
 $ : int [1:98] 6 7 11 14 15 17 18 19 20 21 ...
 $ : int [1:84] 7 11 15 18 19 20 21 26 27 28 ...
 $ : int [1:120] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:119] 3 5 6 7 9 10 11 13 14 15 ...
 $ : int [1:109] 6 7 10 11 14 15 17 18 19 20 ...
 $ : int [1:97] 7 11 15 18 19 20 21 25 26 27 ...
 $ : int [1:123] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:120] 3 6 7 10 11 13 14 15 17 18 ...
 $ : int [1:110] 7 11 14 15 18 19 20 21 25 26 ...
 $ : int [1:125] 3 4 5 6 7 8 9 10 11 12 ...
 $ : int [1:127] 3 5 6 7 10 11 13 14 15 16 ...
 $ : int [1:122] 6 7 11 14 15 17 18 19 20 21 ...
 $ : int [1:122] 3 4 5 6 7 8 9 10 11 12 ...
 $ : int [1:129] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:133] 3 5 6 7 9 10 11 13 14 15 ...
 $ : int [1:129] 6 7 10 11 14 15 17 18 19 20 ...
 $ : int [1:122] 7 11 15 18 19 20 21 25 26 27 ...
 $ : int [1:129] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:136] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:131] 7 11 14 15 18 19 20 21 25 26 ...
 $ : int [1:128] 3 5 6 7 8 9 10 11 12 13 ...
 $ : int [1:139] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:141] 6 7 10 11 13 14 15 16 17 18 ...
 $ : int [1:139] 7 11 14 15 17 18 19 20 21 25 ...
 $ : int [1:137] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:145] 6 7 10 11 13 14 15 16 17 18 ...
 $ : int [1:144] 7 11 14 15 17 18 19 20 21 24 ...
 $ : int [1:133] 3 5 6 7 9 10 11 12 13 14 ...
 $ : int [1:143] 6 7 10 11 13 14 15 16 17 18 ...
 $ : int [1:147] 7 11 14 15 17 18 19 20 21 24 ...
 $ : int [1:126] 5 6 7 9 10 11 12 13 14 15 ...
 $ : int [1:140] 6 7 10 11 13 14 15 16 17 18 ...
 $ : int [1:150] 7 11 14 15 17 18 19 20 21 24 ...
 $ : int [1:152] 11 15 18 19 20 21 25 26 27 28 ...
 $ : int [1:130] 10 11 13 14 15 16 17 18 19 20 ...
 $ : int [1:144] 11 14 15 17 18 19 20 21 24 25 ...
 $ : int [1:152] 11 15 18 19 20 21 25 26 27 28 ...
 $ : int [1:155] 15 19 20 21 26 27 28 29 32 33 ...
 $ : int [1:117] 13 14 16 17 18 19 23 24 25 26 ...
 $ : int [1:134] 14 15 17 18 19 20 24 25 26 27 ...
 $ : int [1:150] 15 18 19 20 21 25 26 27 28 29 ...
 $ : int [1:158] 15 19 20 21 26 27 28 29 32 33 ...
 $ : int [1:104] 16 17 18 23 24 25 26 30 31 32 ...
 $ : int [1:143] 18 19 20 25 26 27 28 31 32 33 ...
  [list output truncated]
 - attr(*, "class")= chr "nb"
 - attr(*, "region.id")= chr [1:1492] "1" "2" "3" "4" ...
 - attr(*, "call")= language dnearneigh(x = coords, d1 = 0, d2 = 4.6, longlat = TRUE)
 - attr(*, "dnn")= num [1:2] 0 4.6
 - attr(*, "bounds")= chr [1:2] "GE" "LE"
 - attr(*, "nbtype")= chr "distance"
 - attr(*, "sym")= logi TRUE
par(mfrow = c(1,2))
plot(total_trips_per_grid_wdmp$area_honeycomb_grid, border = "lightgrey",main="1st nearest neighbours" )
plot(k1, coords, add = TRUE, col = "red", length = 0.88, )

plot(total_trips_per_grid_wdmp$area_honeycomb_grid, border = "lightgrey", main = "Distance Link")
plot(wm_d4.6, coords, add = TRUE, pch = 19, cex = 0.6)

Due to a high number of links, we have very dense graphs which make it difficult to interpret. However, we can still make some observations:

  • The above charts actually illustrates a characteristic of fixed distance weight matrix whereby the hexagonal grid of busttop origin in the centre of Singapore tend to have more neighbours and the edges of Singapore with lesser neighbours like Johor, Tanah Merah Coast.

  • Based on the above charts, we can tell that the geographical areas of the regions in Singapore are highly connected by the bus.

10.1.6 Adaptive Distance Weight Matrix

To overcome the issue of fixed distance weight matrix where there is uneven distribution of neighbours, we can use directly control the numbers of neighbours using k-nearest neighbours, as shown in the code chunk below.

I will set k = 6 i.e., all hexagonal grids will have 6 neighbours for hexagonal grids.

knn6 <- knn2nb(knearneigh(coords, k=6))
knn6
Neighbour list object:
Number of regions: 1492 
Number of nonzero links: 8952 
Percentage nonzero weights: 0.4021448 
Average number of links: 6 
Non-symmetric neighbours list

10.1.7 Plotting distance based neighbours

par(mfrow = c(1,2))
plot(total_trips_per_grid_wdmp$area_honeycomb_grid, border = "lightgrey",main="6 nearest neighbours" )
plot(knn6, coords, add = TRUE, col = "red", length = 0.88, )

plot(total_trips_per_grid_wdmp$area_honeycomb_grid, border = "lightgrey", main = "Distance Link w KNN")
plot(knn6, coords, add = TRUE, pch = 19, cex = 0.6)

10.1.8 Determining Which Weights Matrix to Use

Selecting a spatial weight matrix is use is dependent on the geographical area of interest and the focus of the study. Contiguity-based is preferred for hexagonal grid with uniform sizes because contiguity matrices are well-suited for regular grids where neighboring units share common boundaries. However, in my case, there are some hexagonal grid with no neighbour making contiguity-based not preferable. Therefore, I will use distance-based methods with adaptive distance spatial weight matrix because fixed distance has disadvantage where some regions only have 1 neighbour, while others may have 158 neighbours.

10.1.9 Row-Standardised Weights Matrix

After selecting the weight matrix to use, I will now assign weights to each neighboring polygon. Each neighboring polygon will be assigned equal weight (style="W") by assigning the fraction 1/(#of neighbors) to each neighbouring area. This is also known as a row-standardised matrix where each row in the matrix sums to 1.

rswm_knn6 <- nb2listw(knn6,
                   style = "W",
                   zero.policy = TRUE)
rswm_knn6
Characteristics of weights list object:
Neighbour list object:
Number of regions: 1492 
Number of nonzero links: 8952 
Percentage nonzero weights: 0.4021448 
Average number of links: 6 
Non-symmetric neighbours list

Weights style: W 
Weights constants summary:
     n      nn   S0    S1       S2
W 1492 2226064 1492 462.5 6048.333

I will be using the row-standardised weight matrix for the next part of the analysis.

Notes:

The input of nb2listw() must be an object of class nb. The syntax of the function has two major arguments, namely style and zero.poly.

  • style can take values "W", "B", "C", "U", "minmax" and "S". B is the basic binary coding, W is row standardised (sums over all links to n), C is globally standardised (sums over all links to n), U is equal to C divided by the number of neighbours (sums over all links to unity), while S is the variance-stabilizing coding scheme proposed by Tiefelsdorf et al. 1999, p. 167-168 (sums over all links to n).

  • If zero policy is set to TRUE, weights vectors of zero length are inserted for regions without neighbour in the neighbours list. These will in turn generate lag values of zero, equivalent to the sum of products of the zero row t(rep(0, length=length(neighbours))) %*% x, for arbitrary numerical vector x of length length(neighbours). The spatially lagged value of x for the zero-neighbour region will then be zero, which may (or may not) be a sensible choice.

10.1.10 Computing Global Spatial Autocorrelation Statistics

This in sub-section, I will use two methods: Moran's I and Geary's C to test the hypothesis the following hypothesis:

  • H0: Observed spatial patterns of values is equally likely as any other spatial pattern i.e. data is randomly disbursed, no spatial pattern

  • H1: Data is more spatially clustered than expected by chance alone.

Moran's I

I will perform Moran's I statistical testing by using moran.test() of spdep. Moran's I describe how features differ from the values in the study area as a whole. The Moran I statistic ranges from -1 to 1. If the Moran I is:

  • positive (I>0): Clustered, observations tend to be similar

  • negative (I<0): Disperse, observations tend to be dissimilar

  • approximately zero: observations arranged randomly over space

The below code chunk will perform the Moran's I test on the passenger trips generate by origin.

moran.test(total_trips_per_grid_wdmp$total_trips,
           listw = rswm_knn6,
           zero.policy = TRUE,
           na.action = na.omit)

    Moran I test under randomisation

data:  total_trips_per_grid_wdmp$total_trips  
weights: rswm_knn6    

Moran I statistic standard deviate = 14.627, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2066719242     -0.0006706908      0.0002009369 

Since the p-value < 0.05, I have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone. Since Moran I statistics are larger than 0, the observation are clustered, observations tend to be similar.

Computing Monte Carlo Moran's I

If there are doubts that the assumptions of Moran's I are true (normality and randomisation), a Monte Carlo simulation is used to perform a permutation test for Moran's I.

The permutation tests consists of randomly reassigning the attribute values to a cell under the assumption of no spatial pattern. This random assignment is conducted n times. Each time, I will compute the Moran's I to crerate an empirical distribution of Moran's I under H0.

The code chunk below performs permutation test for Moran's I statistic by using moran.mc() of spdep pacakage with a total of 1000 simulation.

set.seed(1234)
bperm= moran.mc(total_trips_per_grid_wdmp$total_trips, 
                listw=rswm_knn6, 
                nsim=999, 
                zero.policy = TRUE, 
                na.action=na.omit)
bperm

    Monte-Carlo simulation of Moran I

data:  total_trips_per_grid_wdmp$total_trips 
weights: rswm_knn6  
number of simulations + 1: 1000 

statistic = 0.20667, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

Since the p-value is < 0.05, we have sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone.

Visualising Monte Carlo Moran's I

The code chunk below will be used to plot the distribution of the simulated Moran's I test statistics as a histogram using ggplot2 from tidyverse package.

Geary's C

Geary's C considers the difference between respective observations10--this means that it describe how features differ from their immediate neighbours. Geary's C range from -1 to an undefined number above 1. If the Geary's C is:

  • Large (c>1): Dispersed, observations tend to be dissimilar

  • Small (c<1): Clustered, observations tend to be similar

  • c = 1: observations arranged randomly over space

The code chunk below performs Geary's C test for spatial autocorrelation for passenger trips generate by origin using geary.test() of spdep.

geary.test(total_trips_per_grid_wdmp$total_trips, listw = rswm_knn6)

    Geary C test under randomisation

data:  total_trips_per_grid_wdmp$total_trips 
weights: rswm_knn6 

Geary C statistic standard deviate = 7.4283, p-value = 5.502e-14
alternative hypothesis: Expectation greater than statistic
sample estimates:
Geary C statistic       Expectation          Variance 
     0.8176559658      1.0000000000      0.0006025712 

Since the p-value < 0.05, there is ]sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone. The Geary C statistics are less than 1 suggesting that clusters are present . This finding is consistent with the results of the Global Moran's I test in the previous section.

Computing Monte Carlo Geary's C

Similar to Moran's I, Monte Carlo simulation is used to perform a permutation test for Geary's C. The code chunk below performs permutation test for Geary's C statistic by using geary.mc() of spdep pacakagepackeage with a total of 1000 simulation.

set.seed(1234)
bperm_func = geary.mc(total_trips_per_grid_wdmp$total_trips,
                 listw = rswm_knn6,
                 nsim = 999)
bperm_func

    Monte-Carlo simulation of Geary C

data:  total_trips_per_grid_wdmp$total_trips 
weights: rswm_knn6 
number of simulations + 1: 1000 

statistic = 0.81766, observed rank = 1, p-value = 0.001
alternative hypothesis: greater

Since the p-value is 0.001 < 0.05, there is sufficient statistical evidence to reject the null hypothesis at the 95% level of confidence. This means that data is more spatially clustered than expected by chance alone.

11 Local Spatial Autocorrelation Statistics

From the results above, I can

12 Task 2: Local Indicators of Spatial Association (LISA) Analysis

12.1 Deriving Continuity Spatial Weights: Queen’s Method

wm_q <- hunan_GDPPC %>% mutate(nb = st_contiguity(geometry), wt = st_weights(nb, style = “W”), .before=1)

Local Indicators of Spatial Association or LISA are statistics that evaluate the existence of clusters in the spatial arrangement of a given variable.

In this section, you will learn how to apply appropriate Local Indicators for Spatial Association (LISA), especially local Moran’I to detect cluster and/or outlier from passenger trips generate by origin at hexagon level.

The code chunks below are used to compute local Moran’s I of Total Trips at the hexagonal grid level.

fips <- order(total_trips_per_grid_wdmp$grid_id) localMI <- localmoran(hunan$GDPPC, rswm_q) head(localMI)

Reproducible:

just change the read_csv and the file